NoSQL

Back to Production-Engineering/Data-Stores

Database Normalization
Modelling Data
Object Oriented Design to Databases
One to Many relationships
Many to Many

DB architectural design, opposed to SQL. The main advantage of NoSQL over SQL is that it can model data in different ways than a relational model, which can improve performance. A disadvantage is that NoSQL can result in very complicated data models

can store structured or unstructured data, not requiring a predefined schema for data
while SQL databases require migrations for schema changes, NoSQL schemas can interate dynamically
strong for graph stores and document databases
can handle wide column stores, optimized for large queries and joins
SQL databases scale vertically, to ensure validations across data and for cross table joins
NoSQL auto-shard nicely by scaling horizontally with more servers
NoSQL is inherently less structured than SQL, storing information in only two columns, key and value

Database Normalization

In relational databases, it is the process of organizing columns and tables to reduce data redundancy. If multiple Books have the same genre, create a new table Genres with a row Sci-fi where genre attribute of the book points to this row.

Modelling Data

OO, we use classes and objects. In relational DBs, we use tables

Denormalization separates data into documents, of similar data

a document like a book is a hash of keys and values which can be saved into a collection of books
- something like genre, is stored as an array rather than a relation
- this is structured with a multi-key index rather than a second table
represented with high level entities with nested attributes
RDBMS use tables of rows, while NoSQL uses collections of documents
consider how data will be manipulated, and what the access patterns will be like (read/write ratios ...)
no joins

Object Oriented Design to Databases

mapping an OO modelled data, for example an inheritance tree, to RDBMS can end up being modelled as a shapes table, with each row having a shape type with some attributes that may be unique to that type of shape
- lots of columns with lots of nil data, very inefficient
- in NoSQL, query planner will overlook documents that are missing an attribute being queried through a find
- sparce indexes exclude documents without the indexed key

One to Many relationships

Ownership pattern determines which model we should use

Embedded Arrays in documents hold keys for related documents
- slice subsets of arrays
- uses foreign keys
Embedded trees nest documents
- more performance, since documents are nested directly inside
- harder to search, and hard to fetch partial results (must fetch all comments then filters)
Normalized approach using multiple collections (creating
- results in more subqueries (loses data locality)

Many to Many

Relational needs a 3rd join table (again less performant)

NoSQL, we can do one-sided where all categories are stored in embedded arrays in products

Cassandra

Distributed NoSQL DB.

distributed, so no single point of failure
gossip protocol is used
data is indexed with a memtable
writes are fairly cheap, and indexes greatly improve reads at the cost of writes-

Distributed database for managing large amounts of structured data

lives across multiple servers
provides highly available (resilient to outage), no single point of failure
provides a solution to distributing data across data centers and availability zones
uses a masterless "ring" design
doesn't have to use sharded architecture or master-slave models
because there is no master, and no single point of failure, it is truely continuous in uptime

Disadvantages:

doesn't handle transactions (local nor XA (aka 2 phase commit) )
this implies inconsistent indexing
actions are not totally ordered
data is structured around the queries made to the database, so it doesn't handle ad hoc queries well

Gossip Protocol

Communications protocol for spreading new information

through periodic, pairwise, inter-process interactions, nodes spread information in a robust way
information exchanged during interactions is bounded
broadcast_address facilitates node to node interaction, for the purpose of peer discovery within the cluster (uses a public address)
listen_address is for gossip, and is the address that Cassandra binds to

Durability through Storage -- TODO source

Writes first to the CommitLog, and then to the Memtable, overflowing into disk as an SSTable

fsync synchronizes file's in-core state with a storage device
durability is achieved through fsync, flushing the write-back cache to disk (ECE 222)
commitlog is append-only which obviates the need for random seeking
- CommitLogAllocator manages CommitLogSegment instances (each being a file on disk, with a sequence of serialized RowMutation objects)
- commitlog_sync can be periodic or batch which dictates how fsync to disk is called

Data distribution

It is important to spread data across the cluster uniformly

data location is chosen based on the hash of its primary key
primary keys is used to fetch data from a table, a simpke key uses a single column (just user id)
a composite/compound key is generated from more columns (group and user id)
- a compound key is broken into the partition key (the value of the first part of the arguments)
- cluster key is the rest of the arguments
- this concept more abstractly describes how the data locality is distributed in the Cassandra Cluster
  - partion key describes which node will hold respective data
  - cluster key organizes information on the node
- this is logically broken down based on what queries will look like